
INRIA 



INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE 



OC 

c 

c 

(N 

5- 

< 

OC 



Scalable Distributed Video-on-Demand: 
Theoretical Bounds and Practical Algorithms 

Laurent Viennot — Yacine Boufkhad — Fabien Mathieu — Fabien de Montgolfier — Diego 

Perino 






(N 
> 

c 
c 

OC 

c 



X 



N° 6496 

Avril 2008 
. Theme COM 




z 

LU 
+ 



CO 



cr 

< 

cr 






to 



CO 

co 



ku i J\ ycy /\ 

^^Hk7 ROCQUENCOURT 



Scalable Distributed Video-on-Demand: 
Theoretical Bounds and Practical Algorithms 

Laurent Viennocl, Yacine Boufkhadil, Fabien MathievJH , Fabien de Montgolfier^ , Diego 

Perino* 

Theme COM — Systemes communicants 
Projet GANG 

Rapport de recherche n° 6496 — Avril 2008 — [19] pages 



Abstract: We analyze a distributed system where n nodes called boxes store a large set of videos and collaborate 
to serve simultaneously n videos or less. We explore under which conditions such a system can be scalable 
while serving any sequence of demands. We model this problem through a combination of two algorithms: a 
video allocation algorithm and a connection scheduling algorithm. The latter plays against an adversary that 
incrementally proposes video requests. 

Our main parameters are: the ratio u of the average upload bandwidth of a box to the playback rate of a 
video; the maximum number of connections c used for downloading a video; the number m of distinct videos 
stored in the system, i.e. its catalog size. In an homogeneous system (i.e. all node capacities are equal) where a 
box downloads its video with no more than c equal rate connections, we give necessary conditions for achieving 
scalable catalog size. In particular, we prove for that case a lower bound u > max {l + -,/x}, where \x > 1 is 
the maximum growth factor of any swarm of boxes viewing the same video during a period of time equivalent 
to start-up delay (our model tolerates swarms growing exponentially with time). On the other hand, we prove 
that catalog size Cl(n) can be achieved with a centralized scheduling algorithm when u > maxjl + -,/x}, c > 2 
and nodes are reliable. 

Additionally, we propose a distributed connection scheduling algorithm associated to a random video al- 
location scheme for heterogeneous systems where box upload capacity is proportional to storage capacity. It 
achieves catalog size fi(n/logn) and allows to successfully handle a sequence of 0(n) adversarial events with 
high probability as long as u > /i + -. As a special case, it can be used to solve single video distribution with 
O(l) reliable seed boxes, or O(logn) unreliable seed boxes, with constant capacities. 
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Passage a l'echelle de services distribues de videos-a-la-demande 

Resume : Nous considerons un systeme de n nceuds (boites) qui hebergent un ensemble de films et cherche 
a diffuser jusqu'a n flux videos simultanes. Une question qui se pose est de savoir sous quelles conditions un 
tel systeme peut passer a l'echelle tout en supportant n'importe quelle sequence de demandes. Ce probleme 
se decompose en deux parties: la repartition initiale des videos dans les boites et l'allocation des ressources 
en fonction des demandes. Pour ce dernier probleme, nous supposons qu'un adversaire emet des demandes de 
maniere incrementale. 

Les principaux parametres du probleme sont : le rapport u entre l'upload moyen des boites et le debit 
necessaire a la lecture de la video ; le nombre maximal c de connections utilisables dans la recuperation d'une 
video ; le nombre m de films distincts stockes dans le system (la taille du catalogue). 

Dans un systeme homogene (toutes les boites ont les memes capacites), nous donnons a c fixe les condi- 
tions necessaires a la realisation d'un systeme capable de passer a l'echelle. Nous montrons en particulier que 
max {l + 7:,^} es t une borne inferieure pour u, /i > 1 etant le facteur de croissance maximal des ensembles 
de demandes d'une video pendant une periode de temps de l'ordre du temps d'amorce de lecture d'une video 
(notre modele tolere ainsi une croissance exponentielle des demandes). 

Reciproquement, nous prouvons que si u > maxjl + i, (j,}, c > 2 et si les boites sont fiables, alors il est 
possible d'avoir une taille de catalogue m = f2(n), avec un algorithme d'allocation centralise. 

Enfin, nous proposons un algorithme d'allocation distribue, associe a un algorithme de repartition aleatoire 
adapte aux systemes heterogenes ou la capacite d'upload des boites est proportionnelle a leur capacite de 
stockage. II est alors possible de servir une sequence de 0{n) demandes adversariales avec forte probabilite, 
avec une taille de catalogue en Q (n/logra), a la condition d'avoir u > \x + -. 

Mots-cles : video-a-la-demande, passage a l'echelle, pair-a-pair 
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(c) Cached servers 



(d) Peer-assisted 



(e) Fully distributed 



Figure 1: Generic box description, and possible Video-on-Demand architectures 



1 Introduction 



1.1 Background 

The quest for scalability has yield a tremendous amount of work in the field of distributed systems in the last 
decade. Most recently, the peer-to-peer community has grown up on the extreme model where small capacity 
entities collaborate to form a system whose overall capacity grows proportionally to its size. Historically, first 
peer-to-peer systems were devoted to collaborative storage (see, e.g., |11 | 122 } fT3]). The academic community 
has proposed numerous distributed solutions to index the contents stored in a such a system. Most prominently, 
one can mention the numerous distributed hash table proposals (see, e.g., [2T j [23 l [20 l l24|). Extreme attention 
has then been paid to content distribution. There now exists efficient schemes for single file distribution [5]. 
Several proposals were made to cooperatively distribute a stream of data (see, e.g., [3 [TT1 [26l [2T|, [T31 [10])- The 
main difficulty in streaming is to obtain low delay and balanced forwarding load. Most recently, the problem of 
collaborative video-on-demand has been addressed. It has mainly been studied under the single video distribution 
problem: how to collaboratively download a video file and view it at the same time |15 [ l6ll3ll2^[T4 ^ [T2 ^ [7l[T6 l l9]. 
This somehow combines both file sharing and streaming difficulties. On the one hand, participants are interested 
by different parts of the video. On the other hand, an important design goal resides in achieving a small start-up 
delay, i.e. the delay between the request for the video and the start of playback. 

Most of these solutions rely on a central server for providing the primary copy of a video to the set of entities 
collaboratively viewing it. Following the pioneering idea of Suh et al. [25], we propose to explore the conditions 
for achieving fully distributed scalable video-on-demand systems. One important goal is then to enable a large 
distributed catalog, i.e. a large number of distinct primary video copies distributively stored. We thus consider 
the entities storing the primary copies of the videos as part of the video-on-demand system. This model can 
encompass various architectures like a centralized system with download-only clients, a peer-assisted server as 
assumed in many proposed solutions, a distributed server with download-only clients or a fully distributed system 
as proposed in [25]. These scenarios are illustrated by Figure [TJ The fully distributed architecture is mainly 
motivated by the existence of set-top boxes placed directly in user homes by Internet service providers. As these 
boxes may combine both storage and networking capacities, they become an interesting target for building a 
low cost distributed video-on-demand system that would be an alternative to more centralized systems. 



1.2 Related Work 

A significant amount of work has been done on peer-assisted video-on-demand, where there is still a server (or 
a server farm) which stores the whole catalog. Annapureddy et al. [3] investigate the distribution (on-demand) 
of a single video. They propose an algorithm that uses a combination of network coding, segment scheduling 
and overlay management in order to handle high streamrates and slow start-up delays even under flashcrowds 
scenarios. This follows an approach similar to [H] consisting in grouping viewers of the same segment of the 
video together. Adaptations of the BitTorrent protocol to the single video distribution are proposed in [16117], 
Cheng & al. propose [6] connections to nodes at different position in the video to enable VCR-like features 
(seeking, fast-forwarding, . . . ) . A thorough analysis of single video distribution under Poisson arrival is made 
in [15], strategies for pre-fetching of future content are simulated against real traces. Caching strategies are 
tested against real traces in [2]. It is proposed in [TH] to use a distributed hash table to index videos cached by 
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each node. However, there is no guarantee that the videos stay in cache. All these solutions rely on a centralized 
server for feeding the system with primary copies of videos. 

To the best of our knowledge, only a few attempts have been made so far to investigate the possibility of a 
server-free video-on-demand architecture. Suh et al. proposed the Push-to-Peer scheme [25] where the primary 
copies of the catalog are pushed on set-top boxes that are used for video-on-demand. The paper addresses the 
problem of fully distributing the system (including the storage of primary copies of videos) , but scalability of 
the catalog is not a concern. Indeed, a constant size catalog is achieved: each box stores a portion of each video. 
A code-based scheme is combined to a window slicing of the videos and a pre-fetching of every video. The paper 
is mainly dedicated to a complex analysis of queuing models to show how low start-up delay and sufficiently fast 
download of videos can be achieved. The system is tailored for boxes with upload capacity lower than playback 
rate. As we will see, this is a reason why scalable catalog cannot be achieved in this setting. 

Finally, in a preliminary work [I], we begun to analyze the conditions for catalog scalability. This work 
mainly focuses on the problem of serving pairwise distinct videos with a distributed system with homogeneous 
capacities and no node failure. Most notably, an upper bound of n + 0(l) is shown for catalog size when upload 
is too scarce. A distributed video-on-demand is sketched based on pairwise distinct requests and using any 
existing single video distribution algorithm for handling multiply requested videos. We extend much further 
this work to multiple requests, heterogeneous case and node churn scenarios. We can now provide an upper 
bound of o(n) for catalog size when upload is scarce and multiple requests are allowed. Secondly, we prove that 
the maximum flow technique proposed for pairwise distinct requests can be extended to answer any demand with 
possible multiplicity. This requires a much more involved proof. Additionally, we give insight on heterogeneous 
systems where nodes may have different capacities one from another. Finally, we propose a distributed algorithm 
combining both primary video copy distribution and replication of multiply requested videos. Let us now give 
more details about the contributions of the present paper. 

1.3 Contribution 

This paper mainly proposes a model for studying the conditions that enable scalable video-on-demand. Most 
importantly, we focus on scalable catalog size and scalable communication schemes. Our approach consists 
in first formulating necessary requirements for scalability and then try to design algorithms based on these 
minimal assumptions. We call boxes the entities forming the system. Most notably, we require that a box 
downloads a video using a limited number c of connections. This is a classical assumption for having a scalable 
communication maintenance cost in an overlay network. Note that efficient n-node overlay network proposals 
usually try to achieve c = 0(logn). Equivalently, we assume that video data and video stream cannot be 
divided into infinitely small units. With at most c connections, a single connection should have rate at least - 
where 1 corresponds the normalized playback rate of the video. Similarly, as connections have to remain steady 
during long period of times with regard to start-up delay t$, a box should store portions of video data of size at 
least f- . This assumptions of minimal unit of data or minimal connection rate provided by a box of the system 
are particularly natural when one faces the problem of distributing video data on several entities: one have to 
define some elementary chunk size and distribute one or more of them per entity. 

We first show that these discrete nature assumptions on connection rates and chunk size give raise to an 
upload bandwidth threshold. If the average upload u is no more than 1, scalable catalog size cannot be achieved, 
a minimal average upload of 1 + \ is thus required. Theorem [2] states this as soon as c = 0(n 6 ) for any e < \ 
(e.g., c is constant or bounded by a poly-logarithmic function of n). Moreover, a distributed video-on-demand 
system cannot achieve scalable catalog size if the number of arrivals for a given video increases too rapidly. We 
call swarm of a video the set of boxes playing it. If the swarm of a video can increase by a multiplicative factor 
/i > 1 during a period equivalent to start-up delay t$, then it is necessary to have upload u > ^ to replicate 
sufficiently quickly the video data (see Theorem [T]). These lower bounds on u mainly rely on the assumption 
that with large catalog size, some video must be replicated on a limited number of boxes. (This assumption 
may be deduced from our bound c on the number of connections or may be taken for itself) . 

On the other hand, we give algorithms for enabling scalable video-on-demand. We model the algorithmic part 
of a video-on-demand system with two algorithms: a video allocation algorithm is responsible for placing video 
data on boxes, and a scheduling algorithm is responsible for managing video requests proposed by an adversary, 
i.e. propose connections for each box to download its desired video. We build two scheduling algorithms based 
on random allocation of video data. Let us first remark that is not possible to resist node failures if some 
video has its data on a limited number of boxes: an adversary can place node failure events on these boxes 
and then request the video. We thus propose a first scheduler under the assumption that no node fails and 
that we meet the conditions u > maxjl + -^,/i} and c > 2. The problem of finding suitable connections for 
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Number of boxes for serving videos. 

Number of videos stored in the system (catalog size). 

Storage capacity of box i (in number of videos). 

Average storage capacity of boxes. 

Number of duplicates copies of a video with random allocation (k « nd/rn) 

Upload capacity of box i (in number of full video streams). 

Average upload capacity of boxes. 

Maximum number of connections for downloading a video. 

Number of stripes of videos (a video can be viewed by downloading its s stripes simultaneously). 

Minimum ratio of active boxes in an homogeneous system. 

start-up delay: maximum delay to start playing a video. 

Maximum number of arrivals during ts for a video not being played. 

Bound on swarm growth: if a swarm has size p at time t, it has size less than [ip at time t + ts. 



Table 1: Key parameters 

downloading all videos reduce to a maximum flow problem for a given set of requests and a given allocation of 
videos. We thus propose a centralized scheduler running a maximum flow algorithm. If a centralized tracker for 
orchestrating connections has already been proposed in several peer-to-peer architectures, it is not clear whether 
this maximum flow computation could be made in a scalable way. The benefit of this algorithm is thus mainly 
theoretical. It allows to understand the nature of the problem. Theorem [3] states that a random allocation 
enables a catalog of size fl(n) and allows to manage any infinite sequence of adversarial requests with high 
probability (as long as the adversary cannot propose node failures) . The problem of scalable video-on-demand 
can thus be solved with optimal upload capacity in theory. Interestingly, this scheme allows to show that the 
best catalog size is obtained when the storage capacity of boxes is proportional to their upload capacity. 

Additionally, we propose a randomized distributed scheduler based on priority to playback caching, i.e. 
relying on the fact that boxes playing a video can redistribute it. Giving priority to such connections allows to 
be resilient to exponential swarm growth. We show that with the random allocation of fl(n/\ogn) videos in 
a system where average storage capacity is d = fl(\ogn/c) per box, this scheduler can manage 0{n) realistic 
adversarial events with high probability under the assumption that u > fj, + i and the adversary is not aware 
of the scheduler and allocation algorithm choices (see Theorem 0]) . Interestingly, our use of playback caching 
allows to build disjoint forwarding trees for video data in a way similar to Splitstream |5j. The main difference is 
that relaying nodes buffer data before forwarding it and tree levels are ordered according to the playing position 
in the video. 

The paper is organized as follows. Section [3] exposes the requirements that are needed for the catalog to be 
scalable. Section |4] investigates the worst case analysis of the problem with no failures; while Section [5] considers 
more realistic conditions. Then Section [6] proposes to confirm the results of previous sections by the dint of 
simulations. Some proofs are in given in appendix due to space limitations. We now introduce our model for 
video-on-demand systems and the notations used throughout this paper. 

2 Model 

We first introduce the key concepts of video-on-demand systems and discuss the associated parameters. We 
first describe the nodes (often called boxes) of the system, then detail how they may connect to each other to 
exchange data. We then explain how we decompose the algorithmic part of the system and describe adversary 
models for testing our algorithms. 

Video system. We consider a set of n boxes used to serve videos among themselves. Box i has storage capacity 
of di videos and upload capacity equivalent to u, video streams. For instance if Mi = 1, box i can upload exactly 
one stream (we suppose all videos are encoded at the same bitrate, normalized at 1). Such a system will be 
called an (n, u, d)-video system where u = — X)"=i u i ls the average upload capacity and d = -^ Yh=i ^» ls the- 
average storage capacity. A system is homogeneous when Ui — u and di ~ d for all i. Otherwise, we say it is 
heterogeneous. The special case when storage capacities are proportional to upload capacities (i.e. di = ^Ui for 
all i) is called proportionally heterogeneous. 
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The box activity is defined as a state. Box i is active when it can achieve a stable upload capacity no less 
than m or inactive otherwise (e.g. when it is under failure or turned off by user). We suppose that the ratio a 
of active boxes remains roughly constant. We assume that the nodes with higher capacity are not more prone 
to failure than the other nodes, so the average upload capacity of active boxes remains larger than u. An active 
box may be playing when it downloads a video or idle otherwise. The set of boxes playing the same video v is 
called swarm. Node churn occurs as sequence of events consisting in changing the state of a box. Swarm churn 
designates the events concerning a given swarm. We will see in Section l3Tl that scalability cannot be achieved 
when a swarm grows too rapidly. We thus assume a bounded growth factor \x: during a period of time t$ (ts 
is defined below), the size of a swarm is multiplied by a factor /j, at most. More precisely, and to remove any 
quantification issues, we assume that the number of events for a given swarm and a given period of time t is at 
most [i l l ts . (For convenience, we aggregate the various types of swarm churn within the same bound). 

Connections. We assume that finding, establishing and setting up a small buffer for starting video playback 
takes time. We call start-up delay the maximal duration ts for a box to connect to other boxes and begin 
playback. We consider that the number c„ of connections for downloading a video is bounded by some constant 
c. The reason is that with constant swarm churn rate, a box will have to change ft(c n ) connections per unit of 
time. As changing a connection has some latency O(is), this number should remain bounded or grow very slowly 
with n. In connection with this assumption, we suppose that the data of video cannot be split in infinitely small 
pieces. We thus consider that a connection has minimal rate \ (this is obviously the case when connections 
rates are equally balanced, and it can be modeled by aggregating unitary connections otherwise). Therefore 
the minimal piece of video data stored on a box is Cl (-) (a trivial lower bound of ^f follows from previous 
assumptions). 

A peer-to-peer video system without any external video sourcing relies on the possibility to replicate a video 
as it becomes more popular and the number of requests for the video increases. The most straightforward way 
to do this is to cache in each box the video it is currently playing, which is natural if we want to provide some 
VCR functionalities. We call Playback caching this facility: boxes of a swarm can serve as a relay for for the 
boxes viewing a former part of the video. Note, that in order to bring some flexibility in the swarm, the video 
can be split into time windows, thus allowing to avoid linear viewing. Time windowing also allows to reduce 
the problem to the case where all videos have approximately the same duration. 

Video data manipulations. We consider that all videos have same playback rate, same size and same 
duration (all three equal to 1 as they are taken as reference for expressing quantities). To enable multi-source 
upload of a video, each video may be divided in s equal size stripes using some balanced encoding scheme. 
The video can then be viewed by downloading simultaneously the s stripes at rate 1/s. A very simple way of 
achieving stripping consists in splitting the video file in a sequence of small packets. Stripe i is then made of the 
packets with number equal to i modulo s. Note that our connection number limitation imposes s < c. There are 
two main reasons for using stripes: it allows to build internal-node-disjoint trees as discussed in Section \E\ and 
it let a box upload sub-streams of rate - to fully use its upload capacity. Stripes may also enable redundancy 
through correcting codes at the cost of some upload overhead: downloading (1 — e)s stripes is then sufficient to 
decode the full video stream (e.g. using LT-codes [18] or rateless encoding [19|). For the sake of simplicity, we 
assume that s can be large enough to consider all UiS and diS as integrals. As mentioned previously, a video 
can be distributed among several boxes by splitting it according to time windows. However, considering all the 
time windows of all videos being played at given time, we are back to the same problem fundamentally. For 
that reason, we do not develop time windowing. 

Video scheme. A video allocation algorithm is responsible for placing primary copies of each video in the 
system respecting storage capacity constraints of boxes. The most simple scheme consists in storing them 
statically: video data may be replicated but primary copies of videos are static. Video allocation only changes 
when new boxes are added to the system or when the catalog is updated. For instance, re-allocating primary 
copies under node churn would not be practical when live connections consume most of the upload capacity 
of the system. We assume that the catalog renewal is made at a much larger time scale. Its size and storage 
allocation are thus considered fixed during a period of several playback times. We may assume that the catalog 
remains the same during such periods. Of course as the system evolves over a long period of time, some videos 
are added or removed. The catalog size is the number of distinct videos allocated. 

When a box state changes, a scheduling algorithm decides how to update the connections of playing boxes 
so that their video is downloaded at rate greater than 1 and all box upload capacities are respected. For 
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our theoretical bounds (Section [4]) , we use a centralized scheduler that has full knowledge of the system. For 
practical algorithms ( Section!!]), we consider distributed scheduling algorithms: each time a box changes it state, 
it runs the scheduling algorithm on its own. The scheduling algorithm succeeds if it can establish connections 
to download the full video stream in time less than ts- 

We call video scheme a combination of an allocation scheme and a scheduling algorithm. We say that a 
video scheme achieves catalog size m if the allocation scheme can store m videos in the system so that the 
scheduling algorithm succeeds in handling all requests of an adversary. The adversary knows the list of videos 
in the catalog and proposes any sequence of node state changes that respects our model assumptions. In its 
weaker form, it is not aware of the decisions made by the allocation and scheduling algorithms. This is a realistic 
assumption as there is no reason for user requests to be correlated to something the users of the system are 
not aware of. Worst case analysis is obtained with the strong adversary which is the most powerful adversary 
possible. It is additionally aware of the choices made by the allocation and scheduling algorithms. In particular, 
it knows which boxes contain replicas of a given video. If not specified, the adversary is not strong. 

3 Necessary Conditions for Catalog Scalability 

Let us first give some trivial requirements. The total upload is at most un and, as all active boxes may be 
playing, the total download capacity needed may be n so we trivially deduce the lower bound u > 1. As the 
total storage space of active boxes can be as low as adn (assuming that average storage capacity of active boxes 
remains d), we have m < adn. 

Let us first remark that if we release the constraint on bounded connectivity, then ideal storage of adn videos 
can theoretically be achieved in any proportionally heterogeneous (n, 1, d)-video system when c = n. As stated 
in the homogeneous case [25], full stripping can achieve this. It consists in splitting each video in n stripes, one 
per box. Viewing a video then requires to connect to all other boxes. This result can easily be generalized to 
the proportionally heterogeneous case with node failures using correcting codes. Such scheme are unpractical 
for large n but give a theoretical solution. 

On the other hand, we show that some upload provisioning is necessary in our more realistic model. The 
main hypothesis implying these results is that some video is replicated on O (^) boxes at most, i.e. o(n) if 
catalog scales. First note that as soon as a video spans at most o{n) boxes, the system cannot tolerate n strong 
adversarial events. Indeed, the strong adversary can propose failure events on all boxes possessing a given video 
and then propose a request with the video. 

3.1 Maximal Swarm Churn Rate 

We now state that arrival rate in a given swarm must be lower than average upload. This is our first non trivial 
lower bound on average upload. 

Theorem 1 Any homogeneous (n,u,d) -video system achieving catalog size m and resilient to swarm growth \i 
satisfies u > max {2, /i — O (-M } 

For small start-up delay, a realistic value of fi would certainly be less than 2. Scalable catalog size is then 
achievable for u> \i only. 

Proof.We consider a scenario where boxes are viewing different videos, and all of them switch to the same 
video forming a swarm with growth factor [i. The swarm of the video has thus size vs at time 0, vsfi at time 
ts, v sH 2 at time 2ts, and more generally size vs/J? at time its- We choose a video that is replicated at most 
k = O (^) times in the system. If this data is possessed by sufficiently many boxes, it can be replicated k 
times initially. Consider the number of times Xi the data of the video is replicated outside the swarm at time 
its ■ Suppose that all boxes possessing the video either serve new arrivals or pro-actively replicate it with their 
remaining bandwidth. We then have vsfi l+1 + x i+ i < vsu^i 1 + (u — l)xi as the video data must be received by 
all boxes in the swarm and boxes outside the swarm that replicate it. Suppose u < u (otherwise the proof is 
already over). We get Xi + \ < (u — l)xi, and thus Xi < (u — l) l k as x$ < k. The former inequality thus gives 

„, i fc / u—l \ ^ ,, Tf „, ^ n „.„ „l,4-„;„ f„„ „• _ 1„„ „ 4-1. „ 4- ~. ~-~, .. k _ ,. ,o I 1 



> il. If u < 2, we obtain for i = log,, n that u > n — — u, — O ( — ). □ 

Additionally, we can prove that a strict upload of 1 is not sufficient even under low pace arrivals. 
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3.2 Upload Capacity versus Catalog Size 

We thus assume in this section that c = 0(n e ) for some e > 0. This is for example the case when c is a poly-log 
of n as often assumed in overlay networks [2T| I23) [24] . (The rest of the paper assumes a constant c) . With this 
bound, we can establish the following trade-off between average upload capacity and achievable catalog size. 

Theorem 2 For any e > 0, an homogeneous (n, u, d)-video system with u < 1 and c = 0(n e ) that can play any 
demand of n videos in the no failure strong adversary model has catalog size m = O {n 1 / 2+e ^ . 

The above result states that a video system with scarce capacity poorly scales with n. As it is valid in the 
no failure strong adversary model, it remains valid in the strong adversarial model. With our discrete vision of 
connections, it implies that a minimal upload u > 1 + - is necessary for scalability. 

Proof. Suppose there exists e > with £ < i such that c < n 6 . As discussed in Section [2j we use our 
assumption that a box stores no less than ^f data of a given video. 

Suppose by contradiction that there exists a video system with catalog size m > 'M n 1 / 2 + e > MSy/^, As the 
overall storage capacity is dn, there exists some video v whose data is replicated at most ^ < ^\fn times. As 
useful portion of data of v have size at least ^f, the set E of boxes storing data of v has size at most \^fri. Let 
F = E be its complementary. Set p = \E\ < \\fn and q = \F\. 

Now consider the possible request sequence where all boxes b\,. . . ,b q of F successively begin to play v 
while boxes of E play videos not stored at all among boxes in E U {b q }. Box bi can download v from Ei = 
E U {&i, . . . , h-i}. Boxes of E can only download from F' = F\ {b q }. 

Suppose that data of v flows from E to F' at rate p' and from E to b q at rate p" . We have p' +p" < p since 
the overall upload capacity of E is p. Data of v flows internally to F' at rate at least q — 1—p'. The remaining 
upload capacity to serve E is thus p' — (1 — p") < p — 1 as E must additionally serve b q at rate 1 — p" . This 
implies that the number of videos not stored at all on £U {b q } is at most p—l. (Otherwise, we have a request 
that cannot be satisfied.) 

As a box contains data of j 2 - distinct videos at most. We thus deduce m < ^(p + 1) + p — 1 < f L \/n < 
^-n 1 ' 2+e . This is a contradiction and we deduce m = O (n 1 ' 2+£ ). □ 

We deduce from the previous results that u > max{l + ^,^} is a minimal requirement for scalability. We 
now show that it is indeed sufficient. 

4 Strong Adversary Video Scheme 

We now propose a video scheme achieving catalog size fi(ra) in the no failure strong adversary model for any 
video system with average upload u > max { 1 + - , /x} . It is based on random allocation of video stripes using 
s = c stripes per video and uses a maximum flow scheduler. 

4.1 Random Allocation 

Random allocation consists in storing k copies of each stripe by choosing k boxes uniformly at random. This 
approach was proposed by Boufkhad & al [4] using a purely random graph with independent choices. This has 
the disadvantage to unbalance the quantity of data stored in each box. We thus prefer to consider a regular 
bipartite graph where all storage space is used on all boxes. We could obtain the same bounds for the purely 
random graph. Analysis is slightly more complicated in our case. 

For the sake of simplicity, we assume k = dn/m is an integer. A regular random allocation consists in 
copying each stripe in k boxes such that each box contains exactly ds stripe copies. We model this through a 
random permutation it of the kms stripe copies into the dns storage slots of the n boxes together: copy i is 
stored in slot ir(i) (the d\s first slots fall into the first box, the c^s next slots into the second box, and so on). 
The best catalog size is obtained for the smallest possible value of k. 

We call random allocation scheme the video allocation algorithm consisting in selecting uniformly at random 
a permutation 7r and in allocating videos according to it. 

4.2 Maximum Flow Scheduler 

We propose a connection scheduler relying on playback caching. Each time a node state changes, a centralized 
tracker considers the multiset of stripe requests, i.e. the union of all the video stripes being played (some stripes 
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may be played multiple times) and tries to match stripe requests against boxes so that box i has degree at most 
UiS. We can model this problem as a flow computation in the following bipartite graph between stripe requests 
and the boxes storing these stripes. An arc of capacity 1 links every stripe request to all boxes where it is stored 
(either through the static allocation scheme or through playback caching). The scheduling algorithm consists 
in running a maximal flow algorithm to find a flow from stripe requests to boxes with the following constraints: 
each request has an outgoing flow of 1 and such that box i has incoming flow of U{S at most. 

We prove that a random regular graph using s < c stripes with u > max{l + ^,n} has the following 
property with high probability: for any multiset of n requests at most, a flow with the desired constraints 
exists. The proof consists in proving that a random regular allocation graphs has some expander property with 
high probability. A min-cut max-flow theorem allows to conclude and state the following theorem. 

Theorem 3 Consider a proportionally heterogeneous (n 7 u,d)-video system with u > maxjl + -, /A and 

c > 2. Random regular allocation combined with the maximum flow scheduler allows to achieve catalog size 
Cl(dn/log u d) and to manage successfully any infinite sequence of strong adversarial events excepting node fail- 
ures with high probability. 

The proof generalizes in a non trivial manner the proof of [4] that assumes a purely random graph allo- 
cation, pairwise distinct requests and homogeneous capacities. Due to space limitations, the proof is given in 
Appendix A. 

4.3 Heterogeneous Capacities 

As discussed in Appendix A, in the case of heterogeneous capacities, the proof requires the following balance 
condition. For all set E of boxes with overall upload capacity Ue = J2beE u b an d overall download capacity 
De = J2beE db we have for some v! > \x + -: 

D E ~ d 

(The number of copies per stripe in the allocation graph is then k = 0(log„/ d)). 

Note that u' = u in the proportionally heterogeneous case and that u' < u in general. Having storage 
capacity proportional to upload capacity is thus the best situation to optimally benefit from the box capacities. 

In the general heterogeneous case, a possible random allocation scheme consists in using only storage d' b = 
d^jr for each box b for some u" > u achieving best storage capacity. If box upload capacities are within a 
constant ratio, this will achieve a catalog size within a constant ratio of the balanced scheme. 

4.4 Poor Upload Capacity Boxes 

Special care has to be taken for an heterogeneous (n, u, <i)-video system where some boxes have upload capacity 
smaller than /i. We say that such boxes are poor. The above connection scheduler may be defeated by down- 
loading the same video on a large set E of such poor boxes, as it may not support exponential growth. This 
comes from the fact that the storage space for the video coming from playback caching may get larger than Ue ■ 
The above condition on the balance between storage and upload is then violated by playback caching storage. 
The general heterogeneous case is reduced to the case where uploads capacities are all greater or equal to [i 
thanks to the following lemma. (This is the last step of the proof of Theorem [3]). Due to space limitations, the 
proof is given in Appendix A. 

Lemma 1 Consider an (n,u,d) -video system A with np boxes of upload less than /i having overall upload 
capacity Up and a video allocation scheme with s stripes satisfying u > fx. There exists an (n,u 7 d+ np ~ n F )- 
video system B with same video allocation and, for each box b, upload capacity u' b satisfying ^ < u' b < Ub, and 
same average upload u, that can emulate any scheme of A in the no node failure strong adversary model. 

The idea behind this reduction is to statically reserve some upload bandwidth of rich boxes to poor boxes. 
The average upload of both systems is thus the same. When a poor box b with upload Ub < H downloads a 
video, it directly downloads Ubs/fj, stripes as in the scheduling of A and downloads the others through relaying 
by the rich boxes it is associated to. The rich boxes insert also the stripes they forward in their playback cache. 
This explains why more storage capacity is required. Proof is given in Appendix A. 
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5 Distributed Video Scheme 

5.1 Purely Random Allocation 

The video are stored in the boxes according to a purely random allocation scheme: each stripe of a video is 
replicated k times, s still denotes the number of stripes per video used. Each replica is stored in a box chosen 
independently at random. Box i is chosen with probability 4 L . It is possible to add a video in the system as 
long as the k chosen boxes have sufficient remaining storage capacity. Such an allocation scheme is qualified as 
purely random. 

5.2 Playback Cache First Scheduler 

We now propose a randomized distributed scheduling algorithm. The main idea of our scheduler is to give 
priority to playback cache over allocated videos to allow swarm growth fj,. Only one upload connection is 
reserved for video allocation uploading. An average upload u > \x + -j will thus be required. The scheduling 
algorithm is split in two parts: stripe searching and connection granting. 

Stripe searching is the algorithm run by a box for finding another box possessing a given stripe. This 
algorithm relies on a distributed hash table (or any distributed indexing algorithm) to obtain information about 
a given stripe. This index allows a box to learn the complete list of boxes possessing the stripe through the 
video allocation algorithm and a partial list of boxes in the video swarm (i.e. boxes playing the video of the 
stripe). Stripe searching consists in probing the boxes in these lists until a box accepts a connection for sending 
the stripe. A connection request includes the stripe requested and the stripe position in the stripe file (i.e. 
an offset position indicating the next octet of video data to be received) . A box is eligible for a connection if 
it has sufficiently many video stripe data ahead that position and if it has sufficiently many upload. This is 
decided by the connection granting algorithm of the box receiving the request. To give priority to playback- 
cache forwarding, boxes of the allocation scheme are probed only when the swarm size is less than vs or when 
a stripe is downloaded from a video allocation copy less than v$ times. To balance upload, several boxes are 
first probed at the same time, and an accepting box with least number of upload connections for the requested 
video is selected. 

Connection granting is the algorithm run by a box that is probed for a connection request. Suppose box x 
receives a connection request from box y for a stripe of video v. The connection granting algorithm consists in 
the following steps. 

1. If box x is not viewing v and is already uploading the stripe, it refuses. 

2. If box x has sufficient upload capacity, it accepts. 

3. Otherwise, if box x is not playing v , it refuses. 

4. Otherwise, if the stripe position of x for that stripe is not sufficiently ahead the requested stripe position, 
it refuses. 

5. Otherwise, if two or more upload connections of box x concern a stripe of a video different from v, x 
selects one of them at random, closes it and accepts box y. 

6. Otherwise, if box x is uploading the same stripe to some box z and and the requested stripe position of 
y is sufficiently ahead the stripe position of z, it closes the connection to z and accepts. 

7. Otherwise, it refuses. 

Note that Steps 4, 5 and 6 can be executed only if box x plays v. Step 6 can be executed only if it uploads 
us — 1 stripes of video v. (One connection is always reserved to serve allocated stripes). A simple optimization 
in Steps 6 and 7, consists in connection flipping. In Step 6, box x can send to box z the address of box y for 
re-connecting as the stripe position of y is sufficiently ahead the stripe position of z in that case. Box z can 
then probe box y with the same algorithm. In Step 7, box y can be redirected to any box x' downloading v 
from x and having stripe position sufficiently ahead the stripe position of y. Box y can then probe box x' with 
the same algorithm. This way, a box can find its right position according to stripe position in a downloading 
tree path of its swarm. Similarly, in Step 4, box y can make a connection flipping with the box from which x 
is downloading, and go up the downloading chain until it finds its right position. 
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Note that this algorithm works in similar manner as Splitstream [5] builds parallel multicast trees for each 
stripe. The main difference is that each internal node of a tree receives fresh data in a buffer and forwards data 
which is at least t$ old. That way, a performance blip within one node will not percolate to all nodes behind 
it in the sub-tree. Moreover, this ensures that a node has sufficient time to recover from a parent failure. In 
addition, trees are ordered according to stripe position: boxes with foremost playing position in the video get 
closer to the root whereas newcomers in the swarm tend to be in lower tree levels. Another interesting point 
is that nodes downloading from a box with spare number of connections benefit from this free upload capacity 
and download at a rate faster than needed, allowing to fill their buffer. 

5.3 Correctness 

We cannot prove the resilience of our video scheme against any sequence of adversarial events. The following 
technical assumption is necessary for our proof and appears as a realistic hypothesis. We assume that a 
given stripe is searched at most O(logr) times on boxes storing it through the video allocation scheme. This 
requirement is met when the sequence of adversarial events respect the two following conditions. First, a 
constant number of swarms are started on a given video (a realistic assumption if we consider a period of few 
playback durations). (There is no restriction on swarm size). Second, node failures are randomly chosen and a 
given box is chosen with probability p/ < 1 /vs. A sequence of r requests is said to be stress-less if it satisfies 
these conditions. 

Theorem 4 Consider a proportionally heterogeneous (n,u,d) -video system with u> fi+ - and -^ = O(logn). 
For any bound r = 0{n), it is possible to allocate fi(n/logn) videos and successfully manage r adversarial 
stress-less events with high probability. 

To prove this theorem, we analyze a simpler unitary video system which can be emulated by any propor- 
tionally heterogeneous system with same overall capacities. Again, we choose to use s = c stripes per video 
and assume u > fx + ^. We view each box i as the union of ttjS unitary boxes with upload capacity 1/s (one 
stripe) and storage capacity ^ = ^. This reduction is indeed penalizing. Consider two unitary boxes that are 
part of the same real box. In the model, stripes stored on one unitary box can not be uploaded by the other 
whereas the real box could use two uploads slots for any combination of two stripes of any of the unitary boxes. 
For some parameter k made explicit later on, a random allocation of k replicas per stripe is made according 
to the purely random allocation scheme described previously. This is equivalent to suppose that each replica 
is stored in a unitary box chosen uniformly at random since the system is proportionally heterogeneous. As 
each unitary box has a storage capacity of ^ stripes, Chernoff's upper bound allows to conclude that purely 
random allocation of Vt{dsn/u) stripe replicas is possible with high probability when ^ = f2(logn). As we will 
use k = O(logn), this achieves the required catalog size. 

Second, we simplify the scheduler to an algorithm where two schedulers compete. One is allocating cache 
stripe requests within a swarm (i.e. the stripe will be downloaded from a playback caching copy), the other 
is allocating seed stripe requests from the video allocation pool (i.e. the stripe will be downloaded from a 
unitary box possessing it through the random allocation scheme). We consider that both scheduler operate 
independently. This is a penalty with regard to practical scheduling, where simple heuristics may reduce 
considerably the number of conflicts, but it simplifies the stochastic analysis of the system. The cache scheduler 
allocates swarm stripes and has priority: it operates at real box level according to the above algorithm. From the 
unitary box point of view of the seed scheduler, the cache scheduler disables some unitary boxes. If the unitary 
box was uploading some allocated stripe, it is canceled and a seed stripe search is triggered. This is where the 
reservation of one seed stripe per real box is useful in our analysis. A stripe upload connection is canceled when 
the real box has at least two of them. As the cache scheduler cancels one box at random uniformly, a given 
seed stripe is searched at most O(logn) times with high probability. The seed scheduler scans the list of unitary 
boxes possessing the stripe until a free one is found. 

Note that a video request in the real system triggers at most s cache stripe requests and/or s seed stripe 
requests. A node failure on a box uploading us — 1 cache stripes results in us — 1 cache stripe requests. Each 
of them may incur a seed request. The worst event is a video zapping which is equivalent to both events at the 
same timqj. r adversarial requests thus result in (u + l)sr seed stripe searches at most. 

Claim 1 All seed stripe searches succeed with probability greater than 1 — O(-). 



1 In the video zapping event from video v to video v' , the box can can indeed continue to upload the data of v, but it cannot 
continue to download more data. In the worst case, the buffered data may be scarce for all boxes downloading from the box. 
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Proof. We take the point of view of the seed scheduler: a unitary box is free if its real box is active, and the 
cache scheduler is not using it. As the adversary and the cache scheduler operate independently from stripe 
allocation, we make the analysis as if the random choices used for stripe allocation were discovered as seed 
requests arrive. In our case, the purely random scheme consists in allocating each replica in a unitary box 
chosen uniformly at random. We show that for k = O(logn), each replica is considered at most once with high 
probability. 

For instance, consider a seed request for a stripe i. Its list of allocated replicas is scanned forward. Each stripe 
replica falling in an occupied box is discarded until a replica falls in a free unitary box. As observed before, the 
set X of unitary boxes that are either under failure or playback cache forwarding is chosen independently from 
the replica position. The set Y of unitary boxes uploading seeding stripes depends from independent choices 
for other stripe replicas. The probability p that a replica of stripe i falls in one of the t = \X \JY\ occupied 
unitary boxes is p = ' — . Considering that the number of active boxes is n a > an and that average upload 
of active boxes remains u at least, we obtain that the number of failed unitary boxes is at most usn — usn a - 
As the number of current seed connections is \Y\, the number of cache connections is at most sn a — \Y\, We 
thus have |A| < u(n — n a )s + sn a — \Y\ and \X U Y\ < u{n — n a )s + sn a < usn(\ — (1 — ^)a). We thus have 

P<l-(l-i)o. 

As r = 0(n), the number of stripe requests is at most An for some A > 0. As discussed previously, the 
reservation of one stripe for seed connections in real boxes ensures that a given seed connection is discarded 
with probability at most \ . A given stripe is thus discarded at most log 2 An 2 times by the cache scheduler 
with probability 1 — -^ at least. Similarly, for a given stripe, the event that a box uploading it with a seed 
connection fails happens at most O(logn) times with high probability at least according to our stress-less events 
hypothesis. Every stripe is thus discarded at most O(logn) times with high probability. There may be up to vs 
seed connections for a given stripe and stress-less events start at most A' log n swarms on the video of the stripe 
for some A' > 0. This results in vsX'logn stripe searches at most. Finally, we note that with high probability, 
every stripe is searched at most A" logn times with high probability for some constant A" > 0. 

The list of replicas of a stripe can thus be seen as a sequence of zeros (when the replica falls in an occupied 
unitary box) and ones (when the replica is found). A zero occurs with probability less than p and a one with 
probability more than 1 — p. We can conclude the proof if the list of ones in all stripe lists is greater than 
A"logn with high probability. As random choices for each replica are independent, we conclude using Cher- 
noff's upper bound that a sequence of k — 0(logn/(l — p)) replicas contains the required number of ones with 
high probability. (Including the parameters of the model, we use k = 0{^^j logn).) □ 

Of course, this vision of consuming the list of replicas of a stripe is particular to our proof. In practice, one 
can loop back to the beginning of the list when the end is reached. 

Claim 2 All cache stripe searches succeed with probability greater than 1 — O(^). 

Proof. We assume a choice of s such that u > fi + i As in Lemma HJ we suppose that a box i with poor 
upload capacity m < (X + ^ reserves an upload fj, + i — m on some richer boxes. (Note that this augments the 
probability of failure for the node, a problem we do not try analyze here) . A rich box forwarding i stripes to a 
poor box accepts preferentially connections for these stripes (as for stripes of the video it is playing) up to an 
upload bandwidth of fj/i. 

First consider the case where a box b is entering the swarm (i.e. it requests position in the stripe file). 
The swarm Z of v can be decomposed in the set X of boxes arrived in Z before time t — t$ and the set Y of 
boxes arrived in Z later on. We thus have Z = X l±l Y and b £ Y. If X = then \Y\ < vs according to the 
arrival bound of our model and each seed stripe search succeeds with high probability as discussed above. On 
the other hand, if X ^ 0, we have \Z\ < fi\X\ according to the exponential bound on swarm churn in our model. 
The boxes in X have overall upload capacity (u — -)|A| (including the capacity reserved on richer boxes) and 
serve at most \Z\ — -j times the video (box b is still searching for a stripe). As u — i > /i, some connection slot 
is free for accepting the stripe connection of box b. It can always be found if b has the full list of boxes in the 
swarm. The fraction of boxes with exceed capacity for the video is thus at most u _^ 1 / s - Note that a slightly 

higher value of u > \x + - would result in a constant fraction of nodes with exceeded capacity for their video. 
This would allow to find one with high probability if the list of random nodes in the swarm has length O(logn). 
Now consider the case where a box b is reconnecting in its swarm due to some zapping or node failure event. 
We can prove similarly that the connection flipping algorithm allows to find a node in the swarm to connect 
to. This relies on the hypothesis that the number of reconnecting nodes at position t in a video increases by a 
factor fi at most during a period of time ts as assumed by our model (all types of swarm churn are aggregated 
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in the bound (£). □ 



6 Simulations 

In this section we evaluate the performance of a practical allocation scheme by the dint of simulations. This 
scheme is similar to the one described in section [5] but it presents two main differences. Firstly, the storage 
allocation is based on a random regular graph obtained by a permutation 7r of the kms stripes into the dns 
storage slots. This choice is motivated by the more practical aspect of regular random allocation that allows 
to completely fill-in boxes. Secondly, once the connections are established, they cannot be re-negotiated when 
a new video-request is performed. The goal is to test the basic functionning of the algorithm to understand 
where connection renegociations become necessary. 

We assume that every node has a cache of size 1 where it stores all the stripes of the video it is watching. 
We suppose video requests arrive at a constant rate, and ts = 2 minutes. 

As stated in previous sections, the efficiency of an allocation scheme depends on the requests pattern. In 
the following, we use five kind of adversarial schedulers to generate video request sequences: 

• Greedy adversarial. The greedy adversarial scheduler chooses the request for which the system will 
select a node with minimal remaining upload bandwidth (among the set of nodes that can be selected by 
a request in the current configuration). This adversary make greedy decisions. It is strong in the sense 
that it is aware of video allocation and current connections. 

• Random. The random scheduler selects a video uniformly at random in the catalog. 



• 



Netflix. m videos are randomly selected from the Netflix Prize dataset [T] as catalog for our simulated 
system. Requests are performed following the real popularity distribution observed in the dataset. 



• Netflix2. The m most popular videos of the Netflix Prize dataset are selected as catalog for our simulated 
system. Requests are performed following the real popularity distribution of these m videos. 

• Zipf. The scheduler selects videos following a Zipf's law popularity distribution with 7 = 2. 

The peers that perform a request follow a sequence of random permutations of the n peers. All our simulations 
are performed with n — 100 nodes, and the results are averaged over multiple runs. 

6.1 Impact of the number of copies per video 

We study the maximum number of requests the system is able to satisfy as a function of the number k of copies 
per video (k « ^). We suppose that nodes may watch more than one video (for instance if multiple playback 
devices depend on a single box) so the total number of requests can be larger than n, even if n is the typical 
desired target. We set s — 15, u = 1 + - and d = 32. Figure [2] shows that the system is able to satisfy at least 
one request per node if k > 6, independently from the requests pattern. Moreover, for the Random, Netflix and 
Netflix2 schedulers, k > 3 is enough. 

We indicate as reference the maximum number of requests the system can satisfy considering the global 
available upload bandwidth. Note, that for k > 10, nodes almost fully utilize their upload bandwidth and the 
system asymptotically attains the maximum possible number of requests. 

6.2 Varying the number of stripes 

We study the impact of the number of stripes into which videos are split. For this purpose, we set k = 10 , 
d = 32 and w= l + ±. 

Figure [3] shows that the system can satisfy n requests or more for all schedulers but the adversarial. With 
few stripes, the greedy scheduler may find blocking situations were re-configuration of connections would indeed 
be necessary. For low s, more requests are served with other schedulers. This is not surprising, considering 
that a reduction of the number of stripes leads to an increase of the system global bandwidth. As s increases, 
u tends toward 1 and the number of satisfied requests to n. 
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Figure 2: Requests satisfied as a function of A:, n = 100, d = 32, a = 15, u = 1 
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Figure 3: Requests satisfied as a function of s. n = 100, d — 32, k = 10, u = 1 + - 

6.3 Heterogeneous capacities 

We analyze the impact on the number of video requests satisfied in presence of nodes with different upload 
capacities. Node capacity distribution is a bounded Gaussian distribution with u = 1 + i and different variance 
values. We set k = 10, s = 15 and d = 32. Figured] shows the results. Schedulers can satisfy at least n requests 
for small or large values of upload variance, with a slight loss of efficiency between. This may come from the 
fact that we do not use a proportional allocation scheme here. 

6.4 Node failures 

We evaluate the impact of off-line peers on the number of video requests the system can satisfy. We set k = 10, 
s = 15, d = 32 and u = 1 + 7. We then randomly select some nodes and we set them inactive for the simulation. 
Figure [5] shows the system can satisfy video requests for at least all the active nodes in the system up to 
40% failures (a = 0.6). Then, a drastic decrease in the performance occurs. As soon as there are 10% of boxes 
off-line, the adversarial scheduler is able to block the system. 
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Figure 4: Requests satisfied with heterogeneous capacities, n = 100, d = 32, k = 10, s = 15, u = 1 
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Figure 5: Number of requests satisfied with static off-line peers, n = 100, d = 32, k = 10, s = 15, u = 1 + - 

7 Conclusion 

In this paper, we show an average upload bandwidth threshold for enabling a scalable fully distributed video- 
on-demand system. Under that threshold, scalable catalog cannot be achieved. Above the threshold, linear 
catalog size is then possible and the problem of connecting nodes to serve demands reduces to a maximum flow 
problem. A slight upload provisioning allows to build distributed algorithms achieving scalability. 
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Appendix A 

Maximum flow scheduler 

We prove Theorem [3] thanks to the two following lemmas. For the sake of clarity, the proof is written for the 
homogeneous case. It is discussed later on how it generalizes to heterogeneous capacities. 

Lemma 2 (Min-cut max-flow) Consider a bipartite graph from U to V and an integer b > 0. There exist a 
b-matching where each node of node of U has degree 1 and each node of V has degree at most b iff each subset 
U' C U has at least \U'\/b neighbors in V (i.e., the graph is a 1 jb- expander). 

Proof. The 1/6-expander property is clearly necessary. We prove it is sufficient by considering the flow network 
obtained by adding a source node a and a sink node z to the bipartite graph. An edge with capacity 1 is added 
from a to each node in U. Edges of the bipartite graph are directed from U to V and have capacity 1. An 
edge with capacity b is added from each node in ^ to z. The 1/6-expander property implies that every cut has 
capacity \U\ at least. The well-known min-cut max-flow theorem allows to conclude. □ 

Lemma 3 Consider a random regular permutation graph of kms = dns stripe copies into the dns memory slots 
of n boxes. The probability that ki given copies fall into p given boxes with pds > ki is less than (^) . 

Proof. Drawing uniformly at random a permutation of the kms = dns stripes amounts to choose uniformly 
at random a slot for the first stripe, then a slot for the second among the remaining slots and so on. The ki 
stripes are ordered. Let E a denotes the event that the a th copy of stripe falls into one of the pds slots of the 
P boxes. P(n a < kl E a ) = P(E 1 ).P(E 2 \E 1 )...P(E a \E 1 H E 2 ... n E a _ x )... =*£.££... fe£- < (*)* (since 
£^<^forp<n). □ 

nas—i — nas r — ' 



Proof. [of Theorem [3] We assume that s < c is sufficiently large to ensure u > 1 + -. We suppose u > u and 
s > 2. Consider the multiset of stripe requests at some time t. Its size is ns at most as there are no more than 
n videos played. Let S be a sub-multiset of size i among the requested stripes. Let i\ be the number of pairwise 
distinct requests in S and i 2 = i — i\ be the number of duplicated requests in S. As swarm growth is bounded 
by u, there are at least ai 2 nodes where duplicate request can be downloaded with a = j- r 

Let B(S) denote the set of boxes from which any stripe of S may be downloaded. From Lemma [21 a 
connection matching for serving the request can always be found if no multiset 5 of at most rs requested stripes 
verifies l-E^S 1 )! < j with j — ^. Note that B(S) includes at least the given boxes where duplicate requests may 
be downloaded thanks to playback caching. This represents at least ai 2 boxes and |-B(S)| > j for ai 2 > j. We 
may thus consider only ai 2 < i/us (implying i\ > (1 — l/aus)i). By summing over all sets of j = i/us boxes 
and using Lemma O we get the following bound relying only on the stripe copies placed according to the video 
allocation graph (this probability is for i < us): P(\B(S)\ < j) < (™) {^ < («p) l/us (-^) kn . The last 

inequality is obtained by using the standard upper bound of the binomial coefficient (( ) < (^) ). 

Using Markov inequality, the probability that some obstruction multiset S for some request exists is bounded 
by the expected number of such obstructions. By summing the above inequality over all multisets S of at 
most ns stripes, we get the following bound on the probability p that the graph cannot satisfy all possible 
requests: p < Yh1 us X)ii=»(i-i/a«a) -W(mi) (^i^p) 1 ws (^j) n where M(i,«i) is the number of multisets of 
cardinality i taken from sets of stripes of cardinality i\. M{i,i\) is at most M(i, i\) < ( / ){ l ~^ ll J[ ) < 
(tt Y (?) ^ (^4^Y since h < i and considering that k > 1. Notice also that (^f'' 1 < {■±g) kl ~ M ' aUS ■ 
The probability is then at most: p < V" s , „ -*- (-^-) Kl d l < —T / 7\„ (-M" 6 l where 8 = Me 1+1 / us /u and 

^ J r — ^-^i—us aus \uns I — au ^-^t—us \uns/ ' 

K = k — k/aus — \/us — 1. 

It is easy to check that as a function of i the terms of the sum 4>(i) = (^j) ^ decrease from 4>{us), 
reach a minimum at <j>{i*) = <j> ( -hj£- ) then increase to <j)(ns). Using this fact, we bound p by considering 
separately the sum for i < i* and i > i* and by replacing each term with the maximum term on its side. On 
one hand, — EI-J, (—)* 8 % < ^-.nsMus) = -^.ns.-k^S^ < O(-Jkr). On the other hand, the sum of 
the terms of rank greater than i* gives -^ Y^\i* i+i (nks) ^ — -^- n s- u ~ Kns 8 ns < O (n 2 {u~ K 8) ns ). Finally, 
P < O (—kus) + O ((u~ K (5)" s ). For the first term to vanish, we need u~ K 8 < 1 and then k > log u (8). For this, 
we need to replicate each stripe at least k is then k > log M (d) ""f 1 + ™°^* e . For the sake of simplicity, 
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consider s > 2 the lower bound on the number of replicates is k > 21og u (d) + 21og M (4e 2 ) + 1. In this case the 
probability of failure is at most p < O (-^kn) (note that nus > 0) and then the bipartite graph can satisfy all 
possible requests with high probability. Since the number of videos that can be stored is nd/k and given the 
condition on k, the storage capacity is fl(nd/log u (d)). 

D 

Now consider the heterogeneous case. Lemma [2] and the above proof may be generalized. Recall that box b 
has storage capacity db and upload capacity Ub- The condition for an obstruction then becomes J2beB(s) SUb < 
\S\ = i. We can then consider any subset E of boxes with overall capacities Ue = J2beE u b an d -De = J2beE db 
such that Ue < i/s. As boxes are chosen according to their capacity, the probability to put a stripe in E with 
random allocation is thus -^- = ^i r E < -rrf-— for an obstruction. Assuming dyfi- > u' for all E, we can 

nd ndtJE du e ns to De — ' 

follow the same tracks for the proof with the probability of obstruction being less than ^ with j = ^ . With a 
smallest upload capacity of 1 stripe, the total number of such sets E is bounded by ("*") instead of (™) in the 
above proof. This larger factor in the sum over all multiset of request is not a problem when taking a slightly 
larger value of k. We can thus get similar bounds as long as djf- > v! for some v! > max {l + -, /i}. 

Boxes with capacity lower than v! can be grouped with high upload capacity boxes to obtain the desired 
property as proposed in Lemma [TJ 

Poor Upload Capacity Boxes 

Proof. [of LemmaQ] Boxes b with upload capacity b < n are said to be "poor. Boxes with upload capacity exactly 
jjl are said to be medium. Boxes b with Ub > u upload capacity are said to be rich. Let P, M and R denote 
the sets of poor, medium and rich boxes respectively. We set np = \P\, Hm = \M\, and tir = \R\ (Note that 
n = np + um + tir). Let up = ^f- and up — ^ be the mean and overall upload capacities of poor and rich 
boxes respectively. 

We construct B from A with same video allocation. For the sake of simplicity, we assume that s/fi is 
an integral value as well as UbS for each box b. In a pre-processing step, for each poor box b, we reserve 
u(l — ^-)s = us — UbS upload slots from the rich boxes. This upload will be used to forward (1 — — )s stripes to 
the poor box and serve new arrivals in the swarm for up to (/z— 1)(1 — ^)s stripes. This assignment should bal- 
ance the overall number s& of slots reserved on a rich box b such that its remaining upload capacity u' b = Ub — ^f 
remains no less than /i. (In a proportionally heterogeneous system, one would typically choose Sb proportional 
to Ub) A corresponding space of jf£ should also be additionally be reserved for playback caching. We thus set 
d'j = dj + 2^ This assignment is possible when Up — unp > jinp — Up, i.e. u > fi. Now we use a video allocation 
scheme for capacities u' b , d' b (where u' b = Ub and d' b = db for each poor or medium box b). The connection sched- 
uler works as previously except for the download connections of poor boxes. When a poor box b requests a video, 
the s stripes are downloaded from the boxes decided by the previous scheme. However, b downloads Ubf- stripes 
directly but the (1 — — )s others are downloaded via the rich boxes with reserved upload slots for box b. These 
rich boxes participate in the caching of the stripes they forward instead of b. This scheme allows to increase 
the overall upload capacity of the set E of all boxes caching some stripe requested by p boxes so that Ue > n P- D 



RR n° 6496 




Unite de recherche INRIA Rocquencourt 
Domaine de Voluceau - Rocquencourt - BP 105 - 78153 Le Chesnay Cedex (France) 

Unite de recherche INRIA Futurs : Pare Club Orsay Universite - ZAC des Vignes 

4, rue Jacques Monod - 91893 ORSAY Cedex (France) 

Unite de recherche INRIA Lorraine : LORIA, Technopole de Nancy-Brabois - Campus scientifique 

615, rue du Jardin Botanique - BP 101 - 54602 Villers-les-Nancy Cedex (France) 

Unite de recherche INRIA Rennes : IRISA, Campus universitaire de Beaulieu - 35042 Rennes Cedex (France) 

Unite de recherche INRIA Rhone-Alpes : 655, avenue de l'Europe - 38334 Montbonnot Saint-Ismier (France) 

Unite de recherche INRIA Sophia Antipolis : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex (France) 



Editeur 
INRIA - Domaine de Voluceau - Rocquencourt, BP 105 - 78153 Le Chesnay Cedex (France) 

http://www.inria.fr 
ISSN 0249-6399 



